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ABSTRACT 


The human-computer dialogue has recently attracted extensive attention from both academia and industry 
as an important branch in the field of artificial intelligence (Al). However, there are few studies on the 
evaluation of large-scale Chinese human-computer dialogue systems. In this paper, we introduce the Second 
Evaluation of Chinese Human-Computer Dialogue Technology, which focuses on the identification of a user’s 
intents and intelligent processing of intent words. The Evaluation consists of user intent classification (Task 1) 
and online testing of task-oriented dialogues (Task 2), the data sets of which are provided by iFLYTEK 
Corporation. The evaluation tasks and data sets are introduced in detail, and meanwhile, the evaluation 
results and the existing problems in the evaluation are discussed. 


1. INTRODUCTION 


With the development of artificial intelligence, human-computer dialogue technology has become 
increasingly popular and has attracted growing attention [1]. Human-computer dialogue systems are 
conversation agents, which are normally divided into two classes [2, 3]: task-oriented dialogue systems 
[4, 5, 6] and none-task-oriented systems [7, 8]. In this paper, we mainly focus on task-oriented dialogue 
systems. 


t Corresponding author: Zhengyu Zhao (Email: zyzhao@ir.hit.edu.cn; ORCID: 0000-0003-1678-9694). 
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There are two important tasks in a task-oriented dialogue system. One is concerned with classification 
of a user’s intents, which is a text categorization task. Its purpose is to recognize the user's chat intentions, 
such as task-based interaction, a knowledge quiz or chit-chat. It is the foundation for building a large and 
complex human-machine dialogue system [9] and it is a clear but difficult task because of a limited number 
of corpora available for training the algorithms and difficulties in understanding semantic meanings. 
Recently, there have been some evaluations with user intent classification tasks. For example, Task 1 in the 
17th China National Conference on Computational Linguistics (CCL2018)®, which is based on Chinese 
corpora, is a user intent classification task in the customer service field. They provide some open data to 
allow participants to build systems and then test them on hidden data sets. However, the range of data sets 
they provide is limited to Q&A data from China Mobile Communications Group Co., Ltd.®, including the 
query categories, data processing categories and business consulting categories. 


The other is to accomplish tasks in a specific domain in a human-computer dialogue. A complete human- 
computer dialogue system should be capable of understanding the tasks that users want to accomplish and 
assist them in completing a specific domain task, such as inquiring for train information or booking a ticket. 
This is a fairly complex task, which can fully reflect the intelligence of a human-machine dialogue system. 
Another challenge is how to evaluate and compare these systems, and what influencing factors we need 
to pay attention to. A similar evaluation based on English corpora is the 6th Dialogue System Technology 
Challenges (DSTC6) held in 2017 [10]. In DSTC6, participants need to build a system that responds to a 
user’s utterances based on the context of the conversation, where they can use external data. Both objective 
and subjective indicators are used to evaluate the submitted systems [11]. However, the focus of the task 
for participants in DSTC6 is on text generation instead of the complete process of accomplishing the given 
task. As far as we know, the last manual evaluation of the end-to-end task-based dialogue system was the 
Spoken Dialogue Challenge 2010 [12], which was held eight years ago. 


In short, in order to promote the development of the evaluation technology for human-computer dialogue 
systems, and to attract more people to pay attention to the above two key issues in human-computer 
dialogue systems, the Second Evaluation of Chinese Human-Computer Dialogue Technology was held 
during the 7th China National Conference on Social Media Processing® (SMP2018-ECDT), which consists 
of two tasks: 


1). User intent classification. There are 31 categories in total, which include one chit-chat category and 
30 vertical categories of 30 specific tasks such as accessing apps and inquiring about the weather. 
The submitted systems need to determine which category the user’s input belongs to among all of 
the 31 categories. 

2). Online testing of task-oriented dialogues. The submitted systems should complete the corresponding 
tasks about tickets inquiring or reservation through online real-time dialogues with testers. 


® http://www. cips-cl.org/static/CCL201 8/call-evaluation.html 
® http:/Awww.10086.cn 
® http://smp201 8.cips-smp.org/ 
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This Evaluation has not only automatic evaluation (for user intent classification tasks) but also online 
manual testing (for online testing of task-oriented dialogues). Compared with CCL2018 Task 1 and DSTC6, 
this Evaluation bears the following features: 


e Compared to CCL2018 Task1, as organizers of the competition our data set contains a more 
comprehensive and more general set of tags, not just in one area. Specifically, we provide a data set 
which contains 31 user intents that appear frequently in general-purpose chatbots. 

e Compared to DSTC6, we select several reviewers to evaluate the complete process of accomplishing 
a given task, and the reviewers will give their scores for a submitted system during each process. 

e In order to avoid revealing the hidden test set and thereby reducing the possibility of manual 
intervention, we have modified the traditional evaluation method to allow the participating teams 
to set up services to respond to our requests so that the participants do not have to submit the code. 
At the same time, in order to avoid participants obtain the complete test set, we add a lot of noise 
to the test set. 


In addition, compared to SMP2017-ECDT [13], this year we add new data sets for each of the two tasks. 
Our data sets provided by iFLYTEK Corporation® are all labeled manually. Different from Task 1 last year, 
we cancel the evaluation of the closed domain and continue the open domain evaluation only. The 
difference between the closed domain and the open domain is that users can not only use the provided 
training data but also collect data by themselves in the open domain. However, there is no guarantee that 
the participating teams will just use the evaluation data provided by us for training and developing their 
systems if we do not ask them to provide the code. 


The rest of the paper is organized as follows. We introduce two tasks in detail in Section 2 and describe 
the data sets of two tasks in Section 3. Parts of the evaluation results are given in Section 4 and finally the 
conclusion is drawn in Section 5. 


2. THE SECOND EVALUATION OF CHINESE HUMAN-COMPUTER DIALOGUE TECHNOLOGY 


In this section, we give a brief introduction to evaluation tasks. 
2.1 Task 1: User Intent Classification 


The specific descriptions of Task 1 are as follows: build a system that can classify a user's input into the 
most relevant category, including chit-chat or task subcategories, e.g., 


® http://www. iflytek.com/ 
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e What have you done recently? -chat 
PIB FRE ? 

e What's the big news? -news 

ATA BKB A? 

e | want to read free novels. -novel 


REA BEAN) ve 


In Task 1, participating teams do not need to consider the overall intention of multiple rounds of a task- 
based dialogue, but to pay attention to a single round of dialogue. In addition, they are provided with a 
template of an example system® to facilitate the unification of the interface. 


There are many text categorization tasks that use F1-measure as evaluation indicators, such as [14, 15, 
16]. In order to avoid the imbalance of category distribution and meanwhile taking into account each 
category, we also evaluate submitted systems based on the F1-measure obtained from precision and recall. 
Specifically, we first construct a confusion matrix for calculating the Precision P, and Recall R, value of each 

o = 1 = 1 
category, and then take the average precision as P = >. P. and take the average recall as R = — ~ R, 
N i=1 N i=1 
and F1-measure is calculated by Equation (1): 
2PR 
Fac, (1) 
" P+R 


where N denotes the total number of categories. 


2.2 Task 2: Online Testing of Task-Oriented Dialogues 


Task 2 of the Evaluation is described as follows: For a complex task on booking a flight, a train ticket, 
or a hotel room, build a system to guide the user to complete the corresponding task based on the given 
relevant database. In this evaluation, we evaluate submitted systems online manually. Research in [17] 
suggests that the use of crowdsourcing technology is feasible and it can provide reliable results, and our 
reviewers are professional testers from iFLYTEK Corporation, which will be more likely to produce accurate 
results. A complete intent of a flight reservation task is described as: 


ERITAR A EK CER, ERRER” 


“Booking a flight from Beijing to Shanghai in the morning or at noon”. 


The whole dialogue process of this flight reservation task is shown in Table 1, where U denotes the 
utterance of the user and R denotes the response of the agent. 


© https://github.com/WindInWillows/SMP201 8-ECDT-TASK1 
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Table 1. An example of air ticket booking. 


Role Questions and answers 
U BE AMMA R EULER © 
Check out the ticket from Beijing to Shanghai tomorrow. 
R Ts TH AR PELE S ? 
Do you only need an air ticket? 
U M! 
Yes! 
R bes THY 68 BEATZ HR HE AE? 
When are you leaving tomorrow? 
U EF BCE APE 
Morning or noon. 
R DR es AWRA E Ae A hs BEE ? 


The following is the ticket information for you to check, would you like to book a ticket? 

U BiT, WT ZANE. 

OK, Ill take it. 

R CA EMT ZALES, BRB IE ! 
The flight ticket has already been booked for you. Now we go to pay for the ticket! 


Considering a variety of important factors on evaluation of a task-oriented dialogue system, we use the 
following indicators to evaluate the submitted systems in Task 2: 


e Task completion ratio: The number of tasks completed during the test divided by the total number 
of tasks. 

e Average number of dialogue turns: The number of utterances during the process of completing a 
task. 

e Satisfaction score: The subjective score of the system marked by the tester, including 5 integers from 
-2 to 2. 

e Fluency degree of response: Subjective scoring, including 3 integers from -1 to 1. 

e Uncovered data guidance capability: Subjective scoring, including 0 and 1. 


The core purpose of a task-oriented dialogue system is to help users complete a specific task. Then, the 
two most direct indicators for evaluating a task-oriented dialogue system is the task completion ratio and 
the average number of dialogue turns [18, 19]. The task completion ratio indicates the completion of the 
task and is the most important indicator that can reflect the system’s capabilities. In Task 2, a complete 
intent may contain multiple subtasks, such as booking a flight first, and then booking a train ticket, and at 
last booking a hotel room. In order to demonstrate the ability of the system to complete composite tasks, 
when all the subtasks of a composite task are completed, we mark the completion of the task. For the 
average number of dialogue turns, it is counted by the evaluation system. When the task completion ratio 
is the same, the smaller the number of dialogue rounds, the better the system performs. In order to ensure 
that the number of dialogue turns of unfinished subtasks must be greater than or equal the number of rounds 
of completed subtasks, we take the number of dialogue turns of unfinished subtasks as the theoretical 
maximum number. If the maximum number of rounds is exceeded during the test, the current round of 
testing will be terminated. The remaining indicators are the subjective scores of the three reviewers, all of 
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which are average scores. They reflect the performance of the dialogue system in the three aspects, 
respectively. 


Actually, the test method of this evaluation is not only applicable to the Chinese Human-Computer 
Dialogue Technology Evaluation but also can be applied to the same evaluation tasks in other languages 
without too much modification except for the corpus. 


3. EVALUATION OF DATA SETS 


The evaluation data set® in Task 1 is provided by iFLYTEK Corporation, all of which are labeled manually. 
Some specific examples of this data set are shown in Table 2. There are 31 categories of intent data and 
Table 3 shows how the data set is divided. 


Table 2. Some examples in training set of Task 1. 


Input message Intent category 
ARIA KISU. Novel 
Tell me a novel. 
KAE PRM 2 Chat 
What have you done recently? 
FTIF Chrome yi A o App 
Open Chrome browser. 
PREGE. Email 
Write an email for me. 
ITERAR. aes 
Call my brother. 
PERITA E? Stock 


How about the stock of Bank of China? 


Table 3. Statistics of the intent data set in Task 1. 


Train Dev* Test 


Count 2,299 770 1,550 


Note: “Dev” refers to “development set”. 


The data set of Task 2 contains information on flights, train tickets and hotels. It mainly includes the 
origin and destination of the flight or the train, the departure time and arrival time, the price, the type of 
tickets of the flight or the train, and the price and location of the hotel. Participants need to build a task- 
based dialogue system based on this information. In addition, we provide testers with some test cases and 
corresponding starting sentences that contain individual intentions and mixed intentions for tasks on 
booking air tickets, train tickets and hotels. 


® The data set in Task 1 is available at https://worksheets.codalab.org/worksheets/0x27203f932f8341b79841d50ce0fd684f/. 
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4. EVALUATION RESULTS 


In this section, we show partial results of Task 1 and Task 2. Meanwhile, we analyze the results and 
summarize some frequently occurring problems of the two tasks. The complete leaderboards are shown in 
Appendix A. 


4.1 Task 1 


For Task 1, we have received 21 submitted systems in total, and part of the evaluation results are shown 
in Table 4. 


Table 4. The top 8 teams of Task 1 ranked by F1 score. 


Ranking Participant F1 score 

1 AMIR GLO APRA 0.8339 
CloudMinds (Beijing) 

2 REGAL ED Lae ABH CIE ABR ZS 0.8276 
iDeepWise Artificial Intelligence (Beijing) 

3 TERE HE BHAT RA 0.8008 
ABitAl Technology Co., Ltd. 

4 TER ARK OWIE RCT ITS 0.7923 
Spoken Dialogue System Lab, South China Agricultural University 

5 JERR te BBC BR ZS] 0.7846 
Laiye Networktechnology Co., Ltd. 

6 MPKA a ACS Be 0.7735 
School of Computer & Information Technology, Shanxi University 

7 GRAS 0.7722 
Tongji University 

8 LL PAS 0.7648 


Shanxi University 


After evaluating and ranking the submitted systems, we find that the average F1 score (0.8079) of the 
top five entries in this year’s competition is much lower than that of last year (0.9268). The main reason is 
perhaps that the test set of this year is completely new and it is created later than the training set and the 
development set, which makes the test set in the different distribution with the training set and the 
development set. Therefore, the model trained in the training set performs worse on this year’s test set than 
on last year’s test set. This also indicates that many of the current models for text classification tasks have 
considerable losses after migration. 


4.2 Task 2 


Since Task 2 is much more difficult and complex than Task 1, the number of submitted systems is also 
relatively small. A total of 10 systems are submitted in Task 2 (Table 5). The main reference indicators are 
C (task completion rate) and T (the average number of dialogue turns: the smaller the score T, the better 
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the system). In this task, 34.29 is the theoretical maximum number of T, and the maximum penalty is made 
when C is zero. 


Table 5. The top 5 teams of Task 2. 


Ranking Participant G T Sa F G 

1 REZA TAPES ABH GER) ARAS 0.397 26.13 0.667 0.333 0.762 
iDeepWise Artificial Intelligence 

2 RIITTA Sy BBY A RAS E 0.349 21.86 0.429 0.286 0.714 
Centaurs Technologies Co., Ltd. 

3 PHTK Z-CIKE KEE 0.270 28.73 0.2381 -0.064 0.524 
CIKE Lab, South China University of Technology 

4 JER T ESEA RAT 0.222 30.32 0.556 0.349 0.698 
BatOrange Interactive Technology Co., Ltd. 

5 TERRE WEERA DE ZS] 0.127 31.84 -0.222 -0.238 0.191 


Laiye Networktechnology Co., Ltd. 


Note: C denotes task completion ratio, T denotes average dialogue turns, Sa denotes user satisfaction score, F denotes fluency 
degree of response, and G denotes uncovered data guidance capability. All these indicators are average scores of all test cases. 


The results shown in Table 5 are ranked by C firstly, then ranked by T, Sa, F and G in order. Among 
these indicators, C, Sa, F and G are manually labeled and T is calculated by the evaluation system. There 
are three reviewers to score each test case for each participating system. The final score for each indicator 
is the average of its scores on all test cases, given by reviewers or the evaluation system. 


4.3 Analysis 


According to the results, this evaluation has been completed smoothly. Each participating team has 
verified their system on the provided data set and has achieved results that are consistent with their 
expectations. Through this evaluation, some key problems in the human-computer dialogue have attracted 
more people’s attention. In addition, this evaluation mainly focuses on the application of human-computer 
dialogue systems, so it provides some references for the industry to solve the problem of constructing a 
human-computer dialogue system. In the meanwhile we found an interesting phenomenon from the 
evaluation results that the top three teams in the two tasks are almost all from the industry, which demonstrates 
the importance of experience in natural language processing evaluation tasks. 


5. CONCLUSION 


We introduce the Second Evaluation of Chinese Human-Computer Dialogue Technology, which has 
made some adjustments and improvements to solve the problems of the first session of the competition in 
2017. In this paper, we introduce Task 1 and Task 2 of this Evaluation, respectively, and explain the updated 
indicators of the two tasks and the calculation methods of them. In addition, we illustrate the data sets of 
the two tasks. Finally, we show the evaluation results and analyze the problems in the evaluation. 


194 Data Intelligence 


202211.00435v1 


chinaXiv 


ChinaXiv ERAT 
An Evaluation of Chinese Human-Computer Dialogue Technology 


AUTHOR CONTRIBUTIONS 


This work was a collaboration between all of the authors. W. Zhang (wnzhang@ir.hit.edu.cn) is the leader 
of SMP 2018-ECDT, who drew the whole picture of the evaluation. W. Che (car@ir.hit-edu.cn), Z. Chen 
(zgchen@iflytek.com) and Y. Zhang (yibo.cheung@huawei.com) supervised the evaluation process. They 
summarized the conclusion part of this paper. Z. Zhao (zyzhao@ir.hit.edu.cn) summarized the data sets 
and results of SMP2018-ECDT and drafted the paper. All the authors have made meaningful and valuable 
contributions in revising and proofreading the resulting manuscript. 


ACKNOWLEDGEMENTS 


We would like to thank Social Media Processing Committee of Chinese Information Processing Society 
of China (CIPS-SMP) for its strong support for this evaluation. Then we would especially thank Wenxia Feng, 
Xinyi Chen, and Shu Fang. They are the very serious and responsible testers from iFLYTEK Corporation, and 
it is them who complete the online evaluation of Task 2 patiently and impartially. Thanks to Huawei 
Technologies Co. Ltd. for providing financial support which sums up to RMB 50,000 as a bonus for this 
evaluation. Thanks to Lingzhi Li, Caihai Zhu, Yiming Cui, Haoyu Song and Yuanxing Liu for their indispensable 
support during the evaluation. 


REFERENCES 


[1] LV. Serban, C. Sankar, M. Germain, S. Zhang, Z. Lin, S. Subramanian, T. Kim, ... & Y. Bengio. A deep 

reinforcement learning chatbot. arXiv preprint. arXiv:1709.02349, 2017. 

[2] X. Wang, & C. Yuan. Recent advances on human-computer dialogue. CAAI Transactions on Intelligence 

Technology 1(4)(2016), 303-312. doi: 10.101 6/j.trit.2016.12.004. 

[3] H. Chen, X. Liu, D. Yin, & J. Tang. A survey on dialogue systems: Recent advances and new frontiers. arXiv 

preprint. arXiv:1711.01731, 2018. 

[4] L. Cui, S. Huang, F. Wei, C. Tan, C. Duan, & M. Zhou. Superagent: A customer service chatbot for ecommerce 

websites. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics- 

System Demonstrations, 2017, pp. 97-102. doi: 10.18653/v1/P17-4017. 

[5] B. Liu, G. Tur, D. HakkaniTur, P. Shah, & L. Heck. Dialogue learning with human teaching and feedback in 
end-to-end trainable task-oriented dialogue systems. In: The 16th Annual Conference of the North American 
Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, pp. 2060- 
6069. Available at: http:/Awww.aclweb.org/anthology/N18-1187. 

[6] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, ... & D. Yu. Using recurrent neural 

networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio Speech 

Language Processing 23(3)(2015), 530-539. doi: 10.1109/TASLP.2014.2383614. 

[7] R. Yan, & D. Zhao. Coupled context modeling for deep chit-chat: Towards conversations between human and 

computer. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery 

Data Mining, 2018, pp. 2574-2583. doi: 10.1145/3219819.3220045. 

[8] LV. Serban, A. Sordoni, Y. Bengio, A.C. Courville, & J. Pineau. Building end-to-end dialogue systems using 

generative hierarchical neural network models. In: Proceedings of the 30th AAAI Conference on Artificial 

Intelligence, 2016, pp. 3776-3784. Available at: https://dl.acm.org/citation.cfm?id=3016435. 


Data Intelligence 195 


202211.00435v1 


chinaXiv 


ChinX ivé (ERAT 


An Evaluation of Chinese Human-Computer Dialogue Technology 


[9] 


[10] 
[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


[17] 


[18] 


[19] 


196 


A. Bhardwaj, & A. Rudnicky. User intent classification using memory networks: A comparative analysis for a 
limited data scenario. arXiv preprint. arXiv: 1706.06160, 2017. 

DSTC6: Dialogue System Technology Challenges. Available at: http://workshop.colips.org/dstc6/. 

C. Hori, & T. Hori. End-to-end conversation modeling track in DSTC6. arXiv preprint. arXiv: 1706.07440, 
2017. 

A.W. Black, S. Burger, B. Langner, G. Parent, & M. Eskenazi. Spoken Dialogue Challenge 2010. In: 2010 IEEE 
Spoken Language Technology Workshop, 2010, pp. 448-453. doi: 10.1109/SLT.2010.5700894. 

W. Zhang, Z. Chen, W. Che, G. Hu, & T. Liu. 2017. The first Evaluation of Chinese Human-Computer 
Dialogue Technology. arXiv preprint. arXiv: 1709.10217, 2017. 

G. Chen, D. Ye, Z. Xing, J. Chen, & E. Cambria. Ensemble application of convolutional and recurrent neural 
networks for multi-label text categorization. In: International Joint Conference on Neural Networks (IJCNN), 
2017, pp. 2377-2383. doi: 10.1109/IJCNN.2017.7966144. 

B. Tang, S. Kay, & H. He. Toward optimal feature selection in NaiveBayes for text categorization. arXiv 
preprint. doi: 10.1109/TKDE.2016.2563436. 

F. Rousseau, E. Kiagias, & M. Vazirgiannis. Text categorization as a graph classification problem. In: Proceed- 
ings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International 
Joint Conference on Natural Language Processing, 2015, pp. 1702-1712. doi: 10.3115/v1/P15-1164. 

F. Jurcicek, S. Keizer, M. Gasic, F. Mairesse, B. Thomson, K. Yu, & S. Young. Real user evaluation of spoken 
dialogue systems using Amazon Mechanical Turk. In: Proceedings of the Annual Conference of the Interna- 
tional Speech Communication Association, 2011, pp. 3061-3064. Available at: http://mi.eng.cam.ac. 
uk/~sjy/papers/jkgm1 1 .pdf. 

P.-H. Su, M. Gasic, N. Mrk-sic, L. Rojas-Barahona, S. Ultes, D. Vandyke, T.-H. Wen, & S. Young. On-line 
active reward learning for policy optimisation in spoken dialogue systems. arXiv preprint. arXiv: 1605.07669, 
2016. 

A.W. Black, S. Burger, A.-I| Conkie, H. Hastie, S. Keizer, O. Lemon, N. Merigaud, ... & M. Eskenazi. Spoken 
Dialogue Challenge 2010: Comparison of live and control test results. In: Proceedings of the SIGDIAL2011 
Conference, 2011, pp. 2-7. Available at: https://dl.acm.org/citation.cfm2id=2 132892. 


Data Intelligence 


202211.00435v1 


chinaXiv 


ChinX ivé (ERAT 


An Evaluation of Chinese Human-Computer Dialogue Technology 


APPENDIX A: COMPLETE LEADERBOARD 


Table A1. The complete leaderboard of Task 1 ranking by F1 score. 


Ranking Participant F1 score 

1 KWEH GE) ARAH 0.833949 
CloudMinds (Beijing) 

2 RIS AT EBL ARH GE ARAE 0.827594 
iDeepWise Artificial Intelligence (Beijing) 

3 JERR HE RERA IR ZS a] 0.800823 
ABitAl Technology Co., Ltd. 

4 FEAR MEARE OER RCT IT BE 0.792296 
Spoken Dialogue System Lab, South China Agricultural University 

5 TER AE REHA BRAE] 0.784645 
Laiye Networktechnology Co., Ltd. 

6 LL PERT LS fi ANS Be 0.773488 
School of Computer & Information Technology, Shanxi University 

7 DAS 0.772231 
Tongji University 

8 K 0.764794 
Shanxi University 

9 EHTK F-CIKE K 0.762546 
CIKE Lab, South China University of Technology 

10 BRR TL WK 0.759060 
Harbin Institute of Technology 

11 T ARSE ab RAE NLP KET 0.748618 
NLP Lab1, Guangdong University of Foreign Studies 

12 Aba HF ERA BR ZS 0.742506 
BatOrange Interactive Technology Co., Ltd. 

13 JERKE AS BT 0.742133 
NC&IS, Peking University 

14 PARSE SD RAF NLP K2 0.729600 
NLP Lab2, Guangdong University of Foreign Studies 

15 REA AIRS APRA 0.725358 
ZhongAn Techology 

16 PRAGMA ATE TAL 0.720373 
NLP Group, Northwest Normal University 

17 SUE RER CED AMR] 0.714655 
DeepBrain 

18 RHK KELAB 0.692646 
KELAB, Fudan University 

19 es ARS 0.682747 
Harbin Institute of Technology, Shenzhen 

20 FINA RTE F Wb BSE 38 0.496503 
NLP Lab, Zhengzhou University 

21 Wi PEA AN EBA 0.187605 


Little Tiger, Shanxi University 
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Table A2. The complete leaderboard of Task 2. 


ChinX ivé (ERAT 


Ranking Participant C T Sa F G 

1 TRIS A TEDL AS ABH GER) ARAS 0.3970 26.13 0.667 0.333 0.762 
iDeepWise Artificial Intelligence 

2 RITAS RSCG RAS] 0.3490 21.86 0.429 0.286 0.714 
Centaurs Technologies Co., Ltd. 

3 PHTK Z-CIKE KEE 0.2700 28.73 0.238 -0.064 0.524 
CIKE Lab, South China University of Technology 

4 deta FARA I A 0.2220 30.32 0.556 0.349 0.698 
BatOrange Interactive Technology Co., Ltd. 

5 JERR E 2S BLY AT Be ZS E] 0.1270 31.84 -0.222 -0.238 0.191 
Laiye Networktechnology Co., Ltd. 

6 PURE 0.0159 33.11 -0.825 -0.492 0.286 
Shanxi University 

7 SAK KELAB 0.0159 34.29 -0.921 -0.619 0.064 
KELAB, Fudan University 

8 PUR PE 0.0000 34.29 -0.984 -0.508 0.444 
Little Tiger, Shanxi University 

9 IIK BARE A TAL 0.0000 34.29 -1.825 -0.952 0.032 
NLP Group, Northwest Normal University 

10 JERK A T 0.0000 34.29 -1.968 -1.000 0.000 


NC&IS, Peking University 


Note: C denotes task completion ratio, T denotes average dialogue turns, Sa denotes user satisfaction score, F denotes fluency 


degree of response, and G denotes uncovered data guidance capability. All these indicators are average scores of all test cases. 
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